Goto

Collaborating Authors

 decision value


Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Levtsov, Georgii, Ustalov, Dmitry

arXiv.org Artificial Intelligence

With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.


Positive region preserved random sampling: an efficient feature selection method for massive data

Bai, Hexiang, Li, Deyu, Liang, Jiye, Zhai, Yanhui

arXiv.org Artificial Intelligence

Selecting relevant features is an important and necessary step for intelligent machines to maximize their chances of success. However, intelligent machines generally have no enough computing resources when faced with huge volume of data. This paper develops a new method based on sampling techniques and rough set theory to address the challenge of feature selection for massive data. To this end, this paper proposes using the ratio of discernible object pairs to all object pairs that should be distinguished to measure the discriminatory ability of a feature set. Based on this measure, a new feature selection method is proposed. This method constructs positive region preserved samples from massive data to find a feature subset with high discriminatory ability. Compared with other methods, the proposed method has two advantages. First, it is able to select a feature subset that can preserve the discriminatory ability of all the features of the target massive data set within an acceptable time on a personal computer. Second, the lower boundary of the probability of the object pairs that can be discerned using the feature subset selected in all object pairs that should be distinguished can be estimated before finding reducts. Furthermore, 11 data sets of different sizes were used to validate the proposed method. The results show that approximate reducts can be found in a very short period of time, and the discriminatory ability of the final reduct is larger than the estimated lower boundary. Experiments on four large-scale data sets also showed that an approximate reduct with high discriminatory ability can be obtained in reasonable time on a personal computer.


On rough mereology and VC-dimension in treatment of decision prediction for open world decision systems

Polkowski, Lech T.

arXiv.org Artificial Intelligence

Given a raw knowledge in the form of a data table/a decision system, one is facing two possible venues. One, to treat the system as closed, i.e., its universe does not admit new objects, or, to the contrary, its universe is open on admittance of new objects. In particular, one may obtain new objects whose sets of values of features are new to the system. In this case the problem is to assign a decision value to any such new object. This problem is somehow resolved in the rough set theory, e.g., on the basis of similarity of the value set of a new object to value sets of objects already assigned a decision value. It is crucial for online learning when each new object must have a predicted decision value.\ There is a vast literature on various methods for decision prediction for new yet unseen object. The approach we propose is founded in the theory of rough mereology and it requires a theory of sets/concepts, and, we root our theory in classical set theory of Syllogistic within which we recall the theory of parts known as Mereology. Then, we recall our theory of Rough Mereology along with the theory of weight assignment to the Tarski algebra of Mereology.\ This allows us to introduce the notion of a part to a degree. Once we have defined basics of Mereology and rough Mereology, we recall our theory of weight assignment to elements of the Boolean algebra within Mereology and this allows us to define the relation of parts to the degree and we apply this notion in a procedure to select a decision for new yet unseen objects.\ In selecting a plausible candidate which would pass its decision value to the new object, we employ the notion of Vapnik - Chervonenkis dimension in order to select at the first stage the candidate with the largest VC-dimension of the family of its $\varepsilon$-components for some choice of $\varepsilon$.


Distilling Model Failures as Directions in Latent Space

Jain, Saachi, Lawrence, Hannah, Moitra, Ankur, Madry, Aleksander

arXiv.org Artificial Intelligence

The composition of the training dataset has key implications for machine learning models' behavior [Fel19; CLK+19; KL17; GZ19; IPE+22], especially as the training environments often deviate from deployment conditions [RGL19; KSM+20; HBM+20]. For example, a model might struggle on specific subpopulations in the data if that subpopulation was mislabeled [NAM21; SC18; BHK+20; VCG+22], underrepresented [SKH+20; STM21], or corrupted [HD19; HBM+20]. More broadly, the training dataset might contain spurious correlations, encouraging the model to depend on prediction rules that do not generalize to deployment [XEI+20; GJM+20; DJL21]. Moreover, identifying meaningful subpopulations within data allows for dataset refinement (such as filtering or relabeling) [YQF+19; SC18], and training more fair [KGZ19; DYZ+21] or accurate [JFK+20; SHL20] models. However, dominant approaches to such identification of biases and difficult subpopulations within datasets often require human intervention, which is typically labor intensive and thus not conducive to routine usage.


On the Use of Unrealistic Predictions in Hundreds of Papers Evaluating Graph Representations

Lin, Li-Chung, Liu, Cheng-Hung, Chen, Chih-Ming, Hsu, Kai-Chin, Wu, I-Feng, Tsai, Ming-Feng, Lin, Chih-Jen

arXiv.org Artificial Intelligence

Prediction using the ground truth sounds like an oxymoron in machine learning. However, such an unrealistic setting was used in hundreds, if not thousands of papers in the area of finding graph representations. To evaluate the multi-label problem of node classification by using the obtained representations, many works assume in the prediction stage that the number of labels of each test instance is known. In practice such ground truth information is rarely available, but we point out that such an inappropriate setting is now ubiquitous in this research area. We detailedly investigate why the situation occurs. Our analysis indicates that with unrealistic information, the performance is likely over-estimated. To see why suitable predictions were not used, we identify difficulties in applying some multi-label techniques. For the use in future studies, we propose simple and effective settings without using practically unknown information. Finally, we take this chance to conduct a fair and serious comparison of major graph-representation learning methods on multi-label node classification.


Programming by Rewards

Natarajan, Nagarajan, Karthikeyan, Ajaykrishna, Jain, Prateek, Radicek, Ivan, Rajamani, Sriram, Gulwani, Sumit, Gehrke, Johannes

arXiv.org Artificial Intelligence

We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a "decision function" $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r \circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense. We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks.


NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent

Anaissi, Ali, Suleiman, Basem, Zandavi, Seid Miad

arXiv.org Machine Learning

Multi-way data analysis has become an essential tool for capturing underlying structures in higher-order datasets stored in tensor $\mathcal{X} \in \mathbb{R} ^{I_1 \times \dots \times I_N} $. $CANDECOMP/PARAFAC$ (CP) decomposition has been extensively studied and applied to approximate $\mathcal{X}$ by $N$ loading matrices $A^{(1)}, \dots, A^{(N)}$ where $N$ represents the order of the tensor. We propose a new efficient CP decomposition solver named NeCPD for non-convex problem in multi-way online data based on stochastic gradient descent (SGD) algorithm. SGD is very useful in online setting since it allows us to update $\mathcal{X}^{(t+1)}$ in one single step. In terms of global convergence, it is well known that SGD stuck in many saddle points when it deals with non-convex problems. We study the Hessian matrix to identify theses saddle points, and then try to escape them using the perturbation approach which adds little noise to the gradient update step. We further apply Nesterov's Accelerated Gradient (NAG) method in SGD algorithm to optimally accelerate the convergence rate and compensate Hessian computational delay time per epoch. Experimental evaluation in the field of structural health monitoring using laboratory-based and real-life structural datasets show that our method provides more accurate results compared with existing online tensor analysis methods.


Online Tensor-Based Learning for Multi-Way Data

Anaissi, Ali, Suleiman, Basem, Zandavi, Seid Miad

arXiv.org Machine Learning

The online analysis of multi-way data stored in a tensor $\mathcal{X} \in \mathbb{R} ^{I_1 \times \dots \times I_N} $ has become an essential tool for capturing the underlying structures and extracting the sensitive features which can be used to learn a predictive model. However, data distributions often evolve with time and a current predictive model may not be sufficiently representative in the future. Therefore, incrementally updating the tensor-based features and model coefficients are required in such situations. A new efficient tensor-based feature extraction, named NeSGD, is proposed for online $CANDECOMP/PARAFAC$ (CP) decomposition. According to the new features obtained from the resultant matrices of NeSGD, a new criteria is triggered for the updated process of the online predictive model. Experimental evaluation in the field of structural health monitoring using laboratory-based and real-life structural datasets show that our methods provide more accurate results compared with existing online tensor analysis and model learning. The results showed that the proposed methods significantly improved the classification error rates, were able to assimilate the changes in the positive data distribution over time, and maintained a high predictive accuracy in all case studies.


Identifying and Compensating for Feature Deviation in Imbalanced Deep Learning

Ye, Han-Jia, Chen, Hong-You, Zhan, De-Chuan, Chao, Wei-Lun

arXiv.org Machine Learning

In practice, however, we frequently encounter training data with a class-imbalanced distribution . For example, modern real-world large-scale datasets often have the so-called long-tailed distribution: a few major classes claim most of the instances, while most of the other minor classes are represented by relatively fewer instances [16, 31, 38, 50, 51, 61]. Classifiers trained with this kind of datasets using conventional strategies (e.g., mini-batch SGD on uniformly sampled instances) have been found to perform poorly on minor classes [3, 19, 40, 52], which is particularly unfavorable if we evaluate the classifiers with class-balanced test data or average per-class accuracy. One common explanation to the poor performance is the Figure 1: Over-fitting to minor classes and feature deviation: (top-left) the number of training (red) and test (blue) instances per class of an imbalanced CIFAR-10 [8, 32]; (top-right) the training and test set accuracy per class using a ResNet [20]; (bottom) the t-SNE [41] plot of the training (circle) and test (cross) features before the last linear classifier layer. We see a trend of over-fitting to minor classes, which results from the feature deviation of training and test instances (see the magenta and red minor classes).


Everything old is new again: A multi-view learning approach to learning using privileged information and distillation

Wang, Weiran

arXiv.org Machine Learning

Transferring knowledge learned by a powerful model ("teacher") to a simpler model ("student") has become a theme in machine learning. The goal of the knowledge transfer is to have the teacher guide the learning process of the student, so as to achieve high prediction accuracy, or to reduce the sample complexity, which are otherwise hard for the student to achieve by itself. This learning paradigm is practically useful when it is necessary to deploy simpler models to real-world systems, which requires small memory footage or fast processing time. We focus on two specific settings of knowledge transfer in this work. The first one is learning using privileged information (LUPI) [Vapnik and Vashist, 2009], in which the teacher provides an additional set of feature representation to the student during its training process but not the test time, and the extra feature set contains richer information to make the learning problem easier for the student; an example is that the "student may normally only have access to the image of a biopsy to predict the existence of cancer, but during the training process, it also has access to the medical report of an oncologist" [Lopez-Paz et al., 2015]. The second setting is distillation [Ba and Caruana, 2014, Hinton et al.,